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ABSTRACT 

Motivation: Identifying altered pathways in an individual is important 
for understanding disease mechanisms and for the future application 
of custom therapeutic decisions. Existing pathway analysis techniques 
are mainly focused on discovering altered pathways between normal 
and cancer groups and are not suitable for identifying the pathway 
aberrance that may occur in an individual sample. A simple way to 
identify individual's pathway aberrance is to compare normal and 
tumor data from the same individual. However, the matched normal 
data from the same individual are often unavailable in clinical situation. 
Therefore, we suggest a new approach for the personalized identifi- 
cation of altered pathways, making special use of accumulated normal 
data in cases when a patient's matched normal data are unavailable. 
The philosophy behind our method is to quantify the aberrance of an 
individual sample's pathway by comparing it with accumulated normal 
samples. We propose and examine personalized extensions of path- 
way statistics, overrepresentation analysis and functional class scor- 
ing, to generate individualized pathway aberrance score. 
Results: Collected microarray data of normal tissue of lung and colon 
mucosa are served as reference to investigate a number of cancer in- 
dividuals of lung adenocarcinoma (LUAD) and colon cancer, 
respectively. Our method concurrently captures known facts of 
cancer survival pathways and identifies the pathway aberrances that 
represent cancer differentiation status and survival. It also provides 
more improved validation rate of survival-related pathways than when 
a single cancer sample is interpreted in the context of cancer-only 
cohort. In addition, our method is useful in classifying unknown samples 
into cancer or normal groups. Particularly, we identified 'amino acid 
synthesis and interconversion' pathway is a good indicator of LUAD 
(Area Under the Curve (AUC) 0.982 at independent validation). Clinical 
importance of the method is providing pathway interpretation of single 
cancer, even though its matched normal data are unavailable. 
Availability and implementation: The method was implemented 
using the R software, available at our Web site: http://bibs.snu.ac.kr/ 
ipas. 

Contact: tspark@stat.snu.ac.kr or namhuh@samsung.com 
Supplementary Information: Supplementary data are available at 
Bioinformatics online. 

1 INTRODUCTION 

Cancer arises from normal cells and can evolve to become ma- 
lignant, metastatic and/or resistant to therapy. The analysis of 
altered pathways in an individual cancer patient may help to 
understand the disease status and suggest customized anticancer 
therapies. 
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It is straightforward to compare the molecular profile of an 
individual's tumor and normal cells to discover molecular aber- 
rances specific to his/her cancer. However, it may not be feasible 
in the current clinical practice environment to perform a meta- 
static tumor biopsy at the time of treatment resistance in patients 
with advanced cancer (Dancey et al., 2012). A case study of 
custom-tailored medicine based on an individual's genome and 
transcriptome highlights this limitation (Jones et al., 2010). 
A patient's tumor had metastasized to the lung after surgery at 
the primary site. A biopsy from his lung tumor was analyzed by 
mutation and transcription profiling; however, the patient's 
normal lung tissue was not biopsied. Because there was no 
matched normal tissue, messenger RNA (mRNA) expression in 
the patient's own blood and information collected from various 
normal tissues were used to identify differentially expressed genes 
(DEGs). The results of pathway analysis based on DEGs, inte- 
grated copy number variation and mutation information led the 
doctor to change the patient's drug treatment, and the disease 
was stabilized for 3 months. 

Although the personalized interpretation of pathways can be 
demanding, most current pathway analyses have been developed 
to investigate deregulated pathways between two phenotype 
groups. Khatri et al. (2012) classified these methods into three 
types: overrepresentation analysis (ORA), functional class scor- 
ing (FCS) and a pathway topology (PT)-based approach. 

ORA approaches typically apply an arbitrary threshold value 
(e.g., fold change >2 or P<0.05) on gene expression to assess 
whether the number of genes beyond threshold are significantly 
over- or underrepresented in the given pathway. There are two 
drawbacks to ORA. First, it uses only the most significant genes 
and discards others, thus resulting in information loss for mar- 
ginally significant genes (Breitling et al., 2004). Second, it con- 
siders only the number of genes and does not consider the 
magnitude of expression changes, leading to information loss 
regarding the importance of genes (e.g. a gene with a fold 
change of 2.01 and a gene with a fold change of 4 are considered 
equally). Unlike ORA, FCS methods do not discard genes with 
an arbitrary threshold but use all available genes, which is an 
improvement over ORA (Tian et al., 2005). PT methods are 
essentially based on FCS methods with the addition that they 
consider network topology information. They compensate for 
the common limitation of ORA and FCS in reporting 
false-positive gene sets due to sets of overlapping genes. In our 
article, we focus on ORA and FCS methods, extending and 
implementing each for personalized pathway analysis. 

There are two exceptional studies examining individualized 
pathway analysis (Drier et al., 2013; Vaske et al., 2010). 
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PARADIGM is a tool that infers a pathway status by using 
known functional structures. The method models the functional 
structure of pathway as a set of interconnected variables, where 
the variables are omic objects such as DNA, mRNA and protein, 
where the interaction between variables describes the functional 
status of a pathway. PARADIGM may perform better with 
multiple omics, as it uses known functional relationships between 
a gene or inter-gene DNA and protein. Hence, it might not per- 
form well with single layer omic data, such as from niRNA 
microarrays. 

Drier et al. (2013) proposed a personal pathway deregulation 
score (PDS), which represents the distance of a single cancer 
sample from the median of normal samples on the principal 
curve. To calculate PDS, they reduced the dimensions by princi- 
pal component analysis and found the best principal curve, using 
entire cohort samples containing both nomal and/or different 
stages of cancers. Drier's method performs better than 
PARADIGM in the mRNA only datasets of brain and colon 
cancers. Calculating PDS requires data dependent preprocessing 
steps, including selecting the number of principal components to 
be used and filtering out noisy gene data to obtain optimized 
principal curves. PDS fully uses whole cohort data to interpret an 
individual's pathway, which can be a drawback in that it requires 
a number of cohort data to extract principal curve to interpret a 
single patient data. It has a limitation to interpret a single sample 
such as a patient's recurrent tumor that is not accompanied with 
cohort dataset to extract the principal curve. 

Our proposed method is based on the comparison of one 
cancer sample with many accumulated nomial samples (we use 
'nRef to refer to the accumulated normal samples) that is dif- 
ferent from the previous studies in following sense. The proposed 
method is suitable to adopt single-layer omics data and expend- 
able to interpret a patient in the context of many published or 
user-defined pathway gene sets. PARADIGM has less freedom 
in terms of data and gene sets, as it prefers multi-layered omics 
data and requires predefined functional structure among omics 
objects. Unlike PDS, which extracts the principal curve from 
entire cohort data, our method does not assume an individual 
sample belongs to a cohort. We introduce using accumulated 
nomial tissue data as a reference. This is a simple and biologic- 
ally intuitive guideline in such a case to interpret a single sample 
that lack cohort data. 

Our method provides a series of analysis steps, which consists 
of four parts: data processing, gene-level statistics, individualized 
pathway aberrance score (iPAS) and a significance test. To dis- 
cover the most feasible method for iPAS, we extend existing 
pathway analysis techniques, namely, ORA and FCS, to prop- 
erly reflect the nature of testing one cancer to many normal 
samples. 

To demonstrate that iPAS captures biologically and clinically 
relevant information in a sensible, valid and useful manner, we 
apply it to samples of lung and colon adenocarcinoma. We show 
that our representation generates clinically relevant stratifica- 
tions and outcome predictors, which would not have been 
achieved when the same data are analyzed by the conventional 
method that does not use accumulated normal data. 

Our empirical study suggests two different strategies, depending 
on the biological question that iPAS is focused on. In the case of 
cancer diagnosis, a method that uses the inter-gene correlation 



structure of the accumulated normal samples performs best. 
In the case of cancer prognosis, a simple averaging of all 
member genes' standardized gene expression values performs best. 

2 METHODS AND MATERIALS 

2.1 Gene expression data 

We built nRef by the manual curation of data obtained from NCBI GEO 
(Barrett et al., 2012). Microarray data of adjacent normal tissues obtained 
from patients undergoing surgery were selected to serve as the nRef. Data 
from biopsied samples, primary cultures of normal tissues and post- 
mortem donors were not included in the nRef We collected 120 nRef 
for lung, 60 from GSE19804 (Lu et al., 2011), 27 from GSE7670 (Su 
et al, 2007) and 33 from GSE10072 (Landi et al, 2008). Samples came 
from individuals with variable smoking histories and different ethnic 
backgrounds. We collected 101 nRef for colon, concentrating on 
normal mucosa tissue samples from six datasets available at GEO. To 
evaluate the effectiveness of our method in survival analysis, we used 
Beer's data of 442 lung adenocarcinomas (LUADs) (Beer et al., 2002) 
to discover survival-related pathways and validated the associations of 61 
LUAD samples of GSE8894 (Lee et al., 2008). The pathway based iden- 
tification of LUAD were tested on 120 cancers and 120 normal samples 
of GSE19804. GSE7670 and GSE10071. Further validation was con- 
ducted with 48 cancers and 35 normal samples collected from 
GSE19188 (Hou et al, 2010) and GSE31547. For patient stratification 
by colon cancer differentiation status, we used 566 microarrays of 
GSE39582 (Marisa et al, 2013), which provided in a separate inanner, 
443 for discovery, 123 for validation. GSE17536 (Smith et al, 2010) was 
also used for validation. 

2.2 Pathway data 

Information froin gene sets representing biological pathways were ob- 
tained from REACTOME (Croft et al, 2011), which are also provided 
in the Molecular Signature Database (Subramanian et al, 2005). 
Pathways with small number of genes are more easily understood by 
human experts. We decided to filter out pathways of which gene set 
size is >97. The cutoff covers at least 80% of contents of each public 
pathway resources. Of 674 pathways in REACTOME, 583 pathways 
(86.7%) remained after filtering by the gene set size. 

2.3 Individualized analysis using the nRef 

The aim of our approach is to identify altered pathways in an individual 
by making use of the nRef A scheinatic diagram of our method of 
individualized pathway analysis is described in Figure 1, and the follow- 
ing sections describe each step. 

2.3.1 Data preproces.sing and gene-level statistics Expression level 
was defined by using the robust multichip average (Irizarry et al, 2003). 
For datasets using different microarrays, only those with probes in 
common from Affymetrix U133A to Affymetrix U133Plus 2.0 were 
used for further analysis. For individual tumor cases, we performed quan- 
tile normalization (Bolstad et al, 2003) after combining the single tumor 
microarray with all nRef samples. In cases of genes with multiple probes, 
gene expression level was summarized by averaging probe-level expres- 
sion. Individual tumor sample gene expression was standardized using the 
mean and standard deviation of the reference. 

2.3.2 Pathway-level statistics and significance test We introduce 
five methods as candidates for iPAS. Each method is a modification of 
existing pathway analysis techniques, enabling us to test an individual 
tumor sample's pathway aberrance by using the nRef. A suminary is 
provided in Table 1. 
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Fig. 1. Schematic description of individualized pathway analysis using accumulated normal data (nRef). An individual's tumor data are normalized with 
the nRef. Gene expression is standardized by mean and SD of the nRef The iPAS is calculated from standardized gene expression values in the pathway. 
Null distribution calculated from the nRef provides significance 



Table 1. Modification of existing pathway analysis methods for iPAS 



Method 


Gene statistics 


Pathway statistics 


Average Z 


. _ gT, - meanfenRef) 


n 


Fisher 


. _ gji - '"«'«tenRef) 
Stder(gaRef) 

DEG: top 2.5%, 
bottom 2.5% 


/ |DEG| \ /N-IDEGI \ 
min(|DEG|,n) / ' ' \l '\ 

- C) 



max( />!,], - P^i.s) 



GSEA ^ gT, - mean{g„Ra) ip ■\= \^ 1 

*?*v(g„Ref) ^ - Nh) 



Euclidean 


gji - o!ean(g„Ref) 






stdexigaKcf) 


n 


Mahalanobis 


. _ gii - mean{gaRs[) 


Vz5-'z 




stdev(g„Rcf) 


n 



Note. Significance can be obtained against the null distribution generated from 
normal samples. All the collected normal samples for the nRef are one by one 
compared with the nRef to yield statistics of the null distribution. A statistic from 
a single cancer case is compared with this null distribution to yield /"-value. 



Average Z Standardizing the gene expression by inean and standard 
deviation from datasets is often used in microarray analysis. A vector 
Z= (zi, Z2, :„) denotes the expression status of a pathway where 
r; symbolizes the standardized expression value of (-th gene, where 
the number of genes belonging to the pathway is it. In typical settings, 
standardization is performed using the mean and SD of a given dataset, 



inostly the cancer-only cohort data, thus y/n indicates how inuch the 
given sainple's overall pathway gene expression deviates froin the 
center of the cancer samples. We made the simple inodification, 
Z' = {zy, zi;..., r„), where r;- is derived from inean and SD of the 
nRef In this case, y'/'i gives the samples deviation from the nRef 
We beheve this modification is biologically valid because every cancer 
starts its malignancy from normal cell. Thus, the chnical characteristics of 
a single cancer can be captured by measuring the difference of it against 
common characteristic of normal cells, which is represented by the nRef 
in our study. 

Fisher exact test We generated a 2 x 2 contingency table for a given 
pathway (S) and DEGs (D) for the test. For individualized interpretation, 
we define D by the ranking of z-value, which is standardized gene expres- 
sion for the mean and SD of the nRef For each individual sample, 5% 
(highest 2.5% and lowest 2.5%) of the total genes are defined as D. We 
apphed a two-tailed test to detect alteration of pathways due to enrich- 
ment or depletion of differential genes. The result of this statistic can be 
interpreted as how inany DEGs are enriched in the given pathway, where 
the expression difference refers to how much a patient's gene expression 
deviates from the nRef. 

Gene set enrichment analysis We adopted the original version of 
gene set enrichment analysis (GSEA) proposed by Subramanian et al. 
(2005). Typically, inputs for GSEA are generated by testing whole cohort 
samples using phenotype label; t-statistic, correlation coefficients and fold 
changes are usually used. In the personalized analysis setting, we use the 
z-value as an input for the GSEA algorithm, which is standardized gene 
expression by mean and SD of the nRef The GSEA output enrichment 
score reflects the degree to which a gene set in the pathway is oven'epre- 
sented at the extremes (low or high) of the entire ranked list of z-values 
from a single patient. 

Non-parametric quadratic test Gene expression in a pathway of a 
tumor sample is represented by vector Z = (zi, Z2, . . . , z„), where z; is 
standardized expression level of /-th gene by mean and SD of the nRef 
where n is the number of genes belonged to the pathway. A pathway 
characteristic of an individual patient's pathway can be represented by 
the averaged Euclidean distance (Z^Z/n). This gives the distance of a 
single patient from the center of the nRef due to the square of standar- 
dized expression difference, and thus does not reflect increased or 
decreased expression, only the extent of the expression difference. 
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Genes in the pathway are usually functionally correlated; therefore, use of 
the correlation structure of the normal samples may increase sensitivity 
enough to capture the aberrance of a single cancer case. We also consider 
the averaged Mahalanobis (Z^SZ/n) distance, which uses the co variance 
matrix calculated from the nRef This value describes the statistical dis- 
tance from the center of normal samples taking into account correlation 
among normal samples. The covariance matrix S is calculated for each 
pathway from the nRef 



3 RESULTS 

3.1 Pathway-based identification and validation of cancer 
survival 

To assess whether our method can sensitively detect pathway 
aberrances that are associated with a patient's clinical outcome, 
a known survival pathway that showed strong association with 
patient survival from Beer's data was tested. Bryant et al. (2010) 
reported that the 'cell cycle stimulatory' pathway of 51 genes is 
significantly associated with patient survival (Cox proportional 
hazards model, P = 0.000113). In that study, pathway gene 
expression was represented as an average of z-values, where the 
z-value indicates the standardized expression level, by the mean 
and SD, of all cancer samples. The high-risk group was defined 
as those in which pathway expressions were >0, and the pathway 
showed poor prognostic outcome. The association was signifi- 
cant with or without adjusted clinical covariates, and thus the 
pathway alone is a strong indicator of cancer prognosis. This 
finding was also validated in the Japanese LUAD cohort 
{n = 87, survival data are not provided to public) in Bryant's 
study. As studies have shown a clear association between the 
cell cycle pathway and cancer, in terms of driving cancer prolif- 
eration, we considered this pathway as a pathway that should be 
detected. All of the methods proposed as candidates for iPAS 
showed significant associations of the 'cell cycle stimulatory' 
pathway from Beer's data (Table 2). The same pathway analyzed 
using GSE8894 (n = 61) data yielded significant associations in 



Table 2. Survival analysis of 'cell cycle stimulatory' pathway reported by 
Bryant et al. (2010) 



Dataset 


Pathway 


Coefficient 


/"-value 




statistics 






Beer {N = 432), Bryant et al.. 


Average Z" 


0.37 


0.00011 


Overall survival 








Beer {N = 442) 


Average Z'' 


0.62 


0.00003 


Overall survival 


Fisher 


0.50 


0.00068 




GSEA 


0.65 


0.00001 




Euclidean 


0.65 


0.00001 




Mahalanobis 


0.67 


0.00001 


GSE8894 (N = 61) 


Average Z*" 


0.90 


0.01163 


Recurrent free survival 


Fisher 


0.91 


0.01076 




GSEA 


0.78 


0.02899 




Euclidean 


0.87 


0.01544 




Mahalanobis 


0.68 


0.05485 



''Derived from mean and SD of all cancer samples in the dataset, '^Derived by mean 
and SD of llie nRef. 



all proposed methods with the marginal exception of 
Mahalanobis (P = 0.0549). 

Prognostic gene expression signatures for Stages II and III 
colon cancers have been reported in seven papers, yielding 207 
genes in total (Bandres et al., 2007; Barrier et al., 2006, 2007; 
Eschrich et al, 2005; Kopetz and Abbruzzese, 2009; Lin et al., 
2007; Wang et al., 2004). The genes are enriched in 32 
REACTOME pathways (False Discovery Rate (FDR) < 0.05, 
pathway size < 96). We assumed the 32 pathways were valid as 
ground truth to be identified and analyzed in the colon cancer 
dataset GSE39585 (Stages II and III were only considered). 
Average Z provided best performer (sensitivity = 0.88) with 28 
pathways deemed as significant. GSEA, Fisher, Euclidean and 
Mahalanobis gave the following values, 0.78, 0.66, 0.06 and 0.03, 
respectively. 

These results satisfied us that our approach captures the fun- 
damental knowledge of cancer, thus it is reasonably considered 
as iPAS. 

To investigate which of the candidates for iPAS most robustly 
reflect phenotype association, we evaluated the proposed meth- 
ods by determining whether survival-associated pathways are 
validated in datasets never used for discovery using LUAD 
and colon cancer [LUAD: Beer's set n = 442 for discovery, 
GSE8894 (« = 61) GSE3141 {n = 58) for validation; colon 
cancer: GSE39582d (n = 443) for discovery, GSE39582v 
{n = 123) and GSE17536 (« = 109) for validation, logrank 
P<Q.Q5, comparing tumors in the top 50th percentile of aber- 
rance scores to those in the bottom 50th percentile]. Validation 
rates varied depending on the dataset, and these were possibly 
affected by the small sample size compared with that of the 
discovery set. In these cases, we were not able to determine a 
superior method that outperformed the others. Average Z gave 
the highest validation rate in three of four dataset with validation 
rates of GSE8894 (43.6%, 92/211), GSE3I41 (13.3%, 28/211) 
and GSEI7536 (10.7%, 24/224). When validation rates from 
four datasets are averaged. Average Z gave the highest validation 
rate, (21.9%), Fig. 2, blue bars). Pathways validated as signifi- 
cantly associated with patient survival for each cancer are listed 
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Fig. 2. Averaged validation rate of discovered survival-related pathway 
at four datasets. Proposed approach using nRef (blue) versus conven- 
tional approach that standardizes individual sample by mean and SD 
of entire cohort dataset (red) 
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in the Supplementary Materials (Supplementary Tables SI 
and S2). 

We also investigated the validation rate of iPAS candidates 
under the conditions where the same data are not standardized 
by the nRef but instead standardized by the mean and SD of 
the cohort dataset, which consists of only cancers (Fig. 2, red 
bars). It is noteworthy that use of the nRef increased the valid- 
ation rate for every iPAS candidate investigated. This implies 
that the strategy of using accumulated normal samples as a 
reference is beneficial in terms of pathway-based survival 
analysis. 

3.2 Identification of clinical importance 

Cluster analysis of using Average Z as the iPAS method on 
Beer's data identified 12 pathway clusters (denoted by 1~I2 
in Fig. 3) and 3 sample clusters (S2~S4; SI is from the nRef; 
Fig. 3). Sample clusters S2 and S4 represent well the differen- 
tiation status of LUAD (Fisher exact test, P<4.65 x 10"'^). 
Well-differentiated adenocarcinoma resembles the normal glan- 
dular structure; therefore, it is a reasonable result that cluster S2 
is close to the nRef. The survival outcome of S2 and S4 are 
significantly different (/"< 0.0028), and this assures us that un- 
biased clustering-based iPAS has enough sensitivity to capture 
clinically important associations. This finding is concordant with 
prior knowledge that well-differentiated LUAD patients are 
likely to have better prognosis (Barletta et al., 2010). Pathway 
cluster P9 is distinguished as commonly upregulated in tumor 
samples. The pathways are transfer RNA aminoacylation, amino 
acid or purine synthesis, DNA elongation and the extension of 
telomeres. 

Unbiased pathway-based clustering of colon cancer data also 
captures chnically important associations by revealing sample 
clusters that are survival related (S2 and S3, P = 0.0037, 
Supplementary Fig. SI). It is important to note that iPAS is 
not only sensitive enough to identify clinically meaningful sub- 
structure of patients but also reveals common characteristics of a 



cancer at the same time. For example, pathways commonly up- 
or downregulated in all cancer samples, for example, P9 or P2, 
would have not been discovered if the analysis had been per- 
formed by a conventional approach that does not make use of 
'nRef (Supplementary Fig. S2). 

3.3 Pathway-based identification of cancer 

Cancer develops unique mechanisms for malignancy. Therefore, 
it is reasonable to believe that identifying the unique molecular 
aberrances of cancer will aid in cancer diagnosis. Our empirical 
study of iPAS-based clustering of LUAD revealed several 
pathways commonly up- or downregulated in all of the cancer 
samples. Further analysis was performed to determine whether 
iPAS could be successfully used in the accurate identification of 
cancer. We tested this in a simple unsupervised way by judging 
whether an unknown sample is significantly different against the 
nRef, as a tumor, if not as normal. We performed a 5-fold cross- 
validation one hundred times with the LUAD dataset, which 
consisted of 120 cancers and 120 normal samples. Microarray 
data from the normal samples was randomly divided into five 
groups, and four of the five served as the reference group. The 
remaining group was used as the true normal set for the test of 
pathway-based identification of cancer. To build true cancer set 
for the test, the same number of cancer sample was randomly 
picked. We considered 583 pathways in REACTOME, giving 
293 500 (583 pathways x 5-fold x 100 repeats) AUCs and accur- 
acy values. We averaged AUCs and accuracies from the five 
candidate methods for iPAS and used this as a representative 
AUC and accuracy of a given pathway. 

By ranking the pathways by AUC, top pathways that marked 
averagely high performance by all iPAS candidates are listed 
(Supplementary Table S3). The 'amino acid synthesis and in- 
terconversion and transamination' pathway showed the highest 
classification performance. Unsurprisingly, this pathway was one 
of the commonly upregulated pathways in the analysis of the 
Beer's data (Fig. 3, pathway cluster P9). Among the tested 




Fig. 3. Clustered iPAS of LUAD dataset. Pathways (n = 583) and samples (;; = 442) are clustered according to iPAS. Normal samples are clustered at 
left (SI). Tumors (S2~S4) deviate from normal in both up- and downregulated directions (darker red and blue, respectively). Sample clusters are well- 
representing histopathological differentiation status (S2: for well-differentiated LUAD, P<4.65 x 10"'^^) and overall survival 
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Fig. 4. (a) Expression pattern of genes in the pathway. Each hne represents sample, (gray: normal, red: tumor). Dashed line represents expression value 
deviated 1 .96a from the mean expression value of normal tissues, (b). Performance of classification of cancer by 'amino acid synthesis and intercon- 
version and transamination'. AUC of 0.980 has marked in discovery set (95% confidence interval provided as error bar), independent validation set 
results AUC of 0.982 (Validation 1: normal samples in validation set served as reference) and 0.982 (Validation 2: normal samples in discovery set served 
as reference) 



-•-AverageZ -■-GSEA -*-Fisher -•-Euelidam -r- Mahalanobis 




b) 

1.0 
0.9 
0.S 
0.7 
0.6 



120(100%) 



90(75*4) 



60 {50%) 



120(100%) 



90(75%) 



SO (50%) 



Fig. 5. Performance of pathway-based identification of cancer (AUC) 
when only a subset of normal samples are served as nRef (a) 'amino 
acid synthesis and interconversion and transamination' (b) 'unwind of 
DNA' 



iPAS candidates for this pathway, Mahalanobis yielded the high- 
est AUC (0.980), while Average Z gave 0.936, and Fisher's exact 
test gave the lowest value (0.914). The standardized gene expres- 
sion pattern for this pathway differed between tumor and 
normal. Many of the genes deviated from mean of the nRef, 
by more than two orders of signia, contributing to its best per- 
formance out of all iPAS candidate methods, including ORA 
method like Fisher's exact test (Fig. 4a). 

We also analyzed the influence of using the subset of normal 
samples as nRef. We compared the pathway-based cancer iden- 
tification results using the full set of normal samples (n = 120) 
against 100 different runs using 75% (n = 90), 50% (n = 60) of 
randomly chosen normal samples. Among the pathways that 
marked averagely high performance in the identification of 



cancer, the best and the second best pathways are considered, 
'amino acid synthesis and interconversion and transamination' 
and 'unwind of DNA', respectively. The result shows little loss of 
performance, even though only a half of nomial samples were 
used for the test (Fig. 5a and b). 

3.4 Validation of the discovered pathway 

The 'amino acid synthesis and interconversion and transamin- 
ation' pathway consists of 17 genes involved in three major 
reactions, as it is described at REACTOME. The pathways are 
responsible for (i) synthesis of three amino acids (aspartate, 
asparagine, glutamate), (ii) the synthesis of glucose under fasting 
conditions by using carbon atoms from these four amino acids 
and (iii) conversion of amino acids to their corresponding alpha- 
keto acids, coupled to their conversion to glutamate, which is the 
first step in the catabolisni of most amino acids. 

This function makes sense in terms of the 'glutamine addic- 
tion' of cancer cells. The nutrients glucose and glutamine are 
specifically required by cancer cells as metabolites for growth 
and for production of adenosine triphosphate (Munoz-Pinedo 
et al., 2012). Myc and p53 have been revealed to be associated 
with this 'addiction' by upregulating glutamine synthesis in 
cancer cells. Thus, our finding is in accordance with prior know- 
ledge regarding the upregulation of glutamine synthetase. 

We further validated our findings with an independent set that 
were not used in the discovery set. We collected two more LUAD 
gene expression datasets with normal data at GEO (GSE19188, 
GSE31547). Aggregated datasets of 48 microarrays from tumor 
tissues and 35 microarrays from normal tissues were used for 
independent validation. The pathway was also altered in a can- 
cer-specific way in a validation set yielding an AUC of 0.982 by 
Mahalanobis-based iPAS (Fig. 4b, Validation 1). We also as- 
sessed the same validation set in a different manner by using 
the nRef from the discovery set. Normal sample microarrays 
from the discovery sets (GSE10082, GSE7670, GSE10072) 
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served as the nRef to classify samples in the independent valid- 
ation set. The resulting AUC was 0.982 by the Mahalanobis 
method (Fig. 4b, Validation 2). 

In our experiments using LUAD samples, the Mahalanobis 
distance, which used a pre-calculated covariance matrix from 
the 'nRef,' gave the best performance. The usage of covariance 
matrix empowers Mahalanobis to consider a cancer sample as an 
outlier, delivering higher accuracy in terms of pathway-based 
identification of cancer than other methods. One caution of 
using Mahalanobis method is that it requires a large number 
of nomal samples to guarantee the estimation of covariance 
matrix. For a small sample size, a structured covariance matrix 
would be desirable to avoid the estimation issue. 

The biological role of this identified pathway is to supply nu- 
trients and energy to cancer cells. This may be the reason why 
this pathway is universally aberrant in all the LUAD samples we 
assessed. Our analysis of this pathway in other cancer types 
demonstrated less of a role for this pathway, suggesting that it 
is more LUAD specific. We believe that the common disruption 
of this pathway is a novel discovery, as this pathway, consisting 
of 17 genes, has not been reported as an indicator of LUAD in 
any of the studies we acquired datasets from (GSE 10082, 
GSE7670, GSE10072), nor in a literature search with key words. 

4 CONCLUSIONS 

We have proposed personalized extensions to ORA- and FCS- 
based pathway analysis by introducing the concept of comparing 
an individual tumor with many normal samples. Exploratory 
analyses of our methods with previously published survival path- 
way signatures reproduced the correct survival outcomes. We 
have also demonstrated that using nRef improves the validation 
rate. Unbiased clustering with iPASs revealed sample clustering, 
which is indicative of the cancer differentiation status of LUAD 
and of different survival outcomes. Clustering also identifies 
pathway characteristics from patients displaying common up- 
or downregulations and subgroup-specific deregulations. 

Pathways that are commonly deregulated across all cancer 
patients may be useful in identifying cancer from unknown 
samples. We explored the pathway-based identification of 
cancer with 'amino acid synthesis and interconversion and trans- 
amination' pathway, which is commonly upregulated in LUAD 
patients. Validation using independent datasets demonstrated 
that this pathway is useful in classifying LUAD and normal 
lung samples. 

Based on our results, we conclude that individualized pathway 
scores using nRef can provide a sensitive measure of a patient's 
clinical features and can be useful for identifying cancer. 

In our empirical study. Average Z performed best in highlight- 
ing pathway aberrance and in further revealing clinical import- 
ance. It had the best statistical power when identifying a 
previously known survival-related pathway and the best aver- 
aged validation rate for LUAD and colon cancer. In the path- 
way-based identification of cancer, the Mahalanobis method 
performed best. 

An important clinical aspect of our methods is that it enables 
the interpretation of a cancer case in a single patient, even if 
matched normal tissue data from the same individual are 
unavailable. Accumulated information of normal tissues from 



a data repository will take the place of data unavailable for a 
specific individual. As the data repository is growing rapidly, it is 
expected that more 'nRef data will be available for many 
diseases in the near future. We hope that our proposed approach 
can help in the personalized interpretation of tumor data and can 
be a useful tool in the upcoming era of data-based personalized 
medicine. 

Although we have shown our results in microarray platform, 
our method is applicable to different RNA expression platforms 
including next-generation sequencer. Our method is also sup- 
portive of various pathway resources such as KEGG, NCI 
cancer pathway and Biocarta provided in the gmt format. The 
R code for our methods, along with nRefs of LUAD and colon 
cancer used in our study, is available at http://bibs.snu.ac.kr/ipas. 
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